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Abstract 

Background: Genetic association studies are conducted to discover genetic loci that contribute to an inherited trait, 
identify the variants behind these associations and ascertain their functional role in determining the phenotype. To 
date, functional annotations of the genetic variants have rarely played more than an indirect role in assessing 
evidence for association. Here, we demonstrate how these data can be systematically integrated into an association 
study's analysis plan. 

Results: We developed a Bayesian statistical model for the prior probability of phenotype-genotype association that 
incorporates data from past association studies and publicly available functional annotation data regarding the 
susceptibility variants under study. The model takes the form of a binary regression of association status on a set of 
annotation variables whose coefficients were estimated through an analysis of associated SNPs in the GWAS Catalog 
(GC). The functional predictors examined included measures that have been demonstrated to correlate with the 
association status of SNPs in the GC and some whose utility in this regard is speculative: summaries of the UCSC 
Human Genome Browser ENCODE super-track data, dbSNP function class, sequence conservation summaries, 
proximity to genomic variants in the Database of Genomic Variants and known regulatory elements in the Open 
Regulatory Annotation database, PolyPhen-2 probabilities and RegulomeDB categories. Because we expected that 
only a fraction of the annotations would contribute to predicting association, we employed a penalized likelihood 
method to reduce the impact of non-informative predictors and evaluated the model's ability to predict GC SNPs not 
used to construct the model. We show that the functional data alone are predictive of a SNP's presence in the GC. 
Further, using data from a genome-wide study of ovarian cancer, we demonstrate that their use as prior data when 
testing for association is practical at the genome-wide scale and improves power to detect associations. 

Conclusions: We show how diverse functional annotations can be efficiently combined to create 'functional 
signatures' that predict the a priori odds of a variant's association to a trait and how these signatures can be integrated 
into a standard genome-wide-scale association analysis, resulting in improved power to detect truly associated 
variants. 
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Background 

The purpose of genetic association studies is to discover 
genetic loci that contribute to an inherited trait, identify 
the variants behind these associations and ascertain their 
functional role in determining the phenotype [1]. Mod- 
ern association studies bring to bear on this problem 
high coverage genotype data, comprehensive databases of 
genetic variation that allow imputation of most common 
ungenotyped variants to high accuracy and extensive, 
publicly available, in silico resources housing a growing 
assortment of genomic data that allow functional charac- 
terization of vast regions of the human genome. In the 
typical genome-wide association study (GWAS), the first 
two forms of data are combined to reconstruct genotypes 
to a desired density and these genotypes are then sys- 
tematically tested for association with the phenotype. The 
functional annotation data are most frequently used in 
post hoc interpretation of evident associations raised by 
the analysis [2]. 

To date, functional annotation data have rarely played 
more than an indirect role in assessing evidence for asso- 
ciation. For example, they may be used to suggest candi- 
date genes and SNPs for study or to support links between 
candidate SNPs and genes. While methods to incorporate 
functional annotation data a priori in genetic association 
analyses exist, they are infrequently used. The prevailing 
approach to this is via a two-staged hierarchical model in 
which coefficients in the stage I generalized linear model 
for phenotype given genotype and exposure measure- 
ments are regressed, in stage II, on the annotation data 
[3-6]. This is limited to analysis of a modest number of 
variants and does not make use of prior data derived from 
previous association studies to inform the nature of that 
relationship. 

It is becoming increasingly clear that a widening array 
of annotation data correlates with a variant's having been 
associated with a human phenotype [7-14]. In what fol- 
lows, we describe a formal approach to inference for 
association that combines functional annotation data 
(through a prior distribution) with genotype data (through 
a sampling model for the phenotype given genetic and 
other covariate data). We construct the prior distribution 
through careful analysis of SNPs housed in the GWAS 
Catalog [8]. We refer to the linear combination of the 
annotation variables defined by this model and evaluated 
for a given SNP as its 'functional annotation signature'. 
We show that functional signatures so derived are pre- 
dictive of the association status of SNPs not used in their 
creation and that, when coupled with genetic association 
data following the method we describe, improve the effi- 
ciency of association testing in a GWAS study of ovarian 
cancer. Data used to construct and evaluate the functional 
signatures are available at ftp://stat.duke.edu/pub/Users/ 
iversen/FunctionalSignatures/. 



Results and discussion 

In association studies statistical analysis is utilized to 
measure the evidence in favor of association. The data 
that inform these analyses usually comprise phenotype 
labels, SNP genotype data and a set of non-genetic covari- 
ates in addition to functional annotations of the vari- 
ants under study. The statistical analysis may take many 
forms, varying according to choice of modeling approach 
and inferential paradigm (Frequentist or Bayesian). The 
approach we develop here relies on Bayesian inference 
but can also be applied when the genetic association 
summaries are derived from a Frequentist analysis. In 
this paradigm, prior data on a quantity of interest (such 
as the binary association status of a genetic variant) 
are updated to reflect evidence in the current data 
set. 

A Bayesian analysis of genetic association data returns 
an estimate of the posterior odds of association of each 
marker given the available data. When the data take two 
distinct forms — here subject-level phenotype, genotype 
and covariate data and variant-level functional annota- 
tions — the posterior odds of association may be cal- 
culated in two stages, either by incorporating functional 
data prior to or following evaluation of the genetic data. 
The latter represents the heuristic typically followed in 
practice, whereby functional data is evaluated in an infor- 
mal way (from the probabilistic point of view) conditional 
on evidence for association. Here we describe a model- 
based framework for combining functional and associa- 
tion data following the second factorization. We focus on 
the case-control study design for purposes of illustrating 
integration of the a priori (to association data) models for 
functional annotation data we describe below into analy- 
ses of genetic association data. Details of the models and 
their assumptions are provided in Methods. 

When the functional data are incorporated as prior 
information, the posterior odds of a SNP's association 
given the functional and subject-level data can be written 
as the product of the Bayes factor (BF) in favor of associa- 
tion and the prior odds of association given the functional 
data. The BF is the ratio of the integrated likelihood of 
the phenotype data given the covariate and genotype data 
assuming the SNP is associated to the integrated likeli- 
hood of the phenotype data given the covariate data only 
(i.e. assuming the SNP is not associated). It is a commonly 
used Bayesian statistical measure of association and is cal- 
culated by the SNPTEST [15] and BIMBAM [16] packages 
for analysis of GWAS data. Alternately, it can be approx- 
imated by the ABF [17,18], allowing the method to be 
used in conjunction with standard Frequentist association 
testing software. 

In short, the functional annotation data are incorpo- 
rated into an analysis by formally updating the prior odds 
of association given the annotation data by a standard 
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measure of genetic association. This process is depicted 
schematically in Figure 1. In what follows, we describe 
the model used to calculate prior odds of association 
and demonstrate its use in a GWAS of ovarian cancer. 
In it, the log of a SNP's prior odds of association equals 
a + Ffi, where F denotes its functional data and a and 
P are parameters. We refer to the linear combination F/3 
associated with a SNP as its 'functional signature'. 

Functional signatures of known associations 

We constructed the functional annotation signatures by 
estimating the multivariate relationship between a set of 
functional annotation variables and the binary association 
status of a set of SNPs. In this way, we identify a linear 
combination, fiF, of the functional annotations, denoted 
F. We refer to this linear combination as a SNP's 'func- 
tional signature' when it is evaluated using that SNP's 
annotations, F s . Figure 2 provides a schematic of our 
approach. In brief, we identified a set of associated SNPs 
and, for each, we chose a matching, unassociated 'control' 
SNP. We divided the matched pairs into 'training' and 
'validation' sets and used the former to construct a series 
of models to predict association status given the func- 
tion data and used the latter to compare the performance 
of these models. We chose the model that demonstrated 
the best predictive accuracy in the validation data, as 



measured by concordance index, to define the functional 
annotation signatures. 

We began by constructing a matched case-control study 
of SNPs in which the cases were drawn from the GWAS 
Catalog [8] and the controls were identified from the 
HapMap database, Release 27, Phases II and III merged 
genotypes. We identified 2,093 case SNPs and, for each, 
identified one control SNP matched on chromosome, 
minor allele frequency and the genotyping platform(s) 
it appeared on. Since SNPs in the GWAS Catalog are 
arguably more frequently tags than the directly associ- 
ated variant, we identified 'LD partners' [8] for each case 
and control SNP. We grouped each case and control SNP 
together with its LD partners to form blocks. 

Using on-line bioinformatics resources, we assembled 
a set of functional annotation variables representing a 
variety of contextual descriptions and empirical mea- 
surements with which we annotated each of the 48,889 
case, control and LD partner SNPs. We included anno- 
tation variables shown to be correlated with presence 
in the GWAS Catalog or that we believed likely to be 
so. These were: dbSNP function designation; summaries 
of ENCODE Project [19,20] data on transcription levels 
assayed by RNA-seq [21,22], measures of signal enrich- 
ment for H3K4Mel, H3K27Ac and H3K4Me3 histone 
modifications associated with enhancer and promoter 
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Figure 1 Two-staged procedure for integrating variant-level functional annotation data with subject-level genetic association data. At 

the first stage, functional annotation data are combined to estimate the prior (to observing the genetic association data) probability of association 
for each variant. At stage two, these estimates are combined with the Bayes Factor (a metric of association) in favor of genetic association via Bayes' 
formula to estimate the posterior (to observing the functional and genetic association data) probability of association for each variant. 



Iversen etal. BMC Genomics 2014, 15:398 
http://www.biomedcentral.eom/1 471 -2 1 64/1 5/398 



Page 4 of 16 




Prior Prediction 

SNP case 

1. rsOOOll 1 

2. rs00012 0 

836. rs00016 



|^^^^ | Conco r dance j 



Figure 2 Construction and evaluation of models for (prior) probability of association given the functional annotation data. The purple 
arrows represent model construction ('training'), while the green arrows represent evaluation of the models. Construction of the training set, 
validation set and functional annotation database are depicted in Additional file 1 : Figure S1 0 and described in Methods. The training data were 
used to construct a series of models, each distinguished by the coefficients (or 'weights') it assigns to the various annotation variables. We chose the 
best among these by comparing their predictions in the validation set using the concordance index. 



activity [23,24], evidence for overlap with a DNasel hyper- 
sensitivity cluster [25,26] and evidence for transcription 
factor binding [27-31]; PhyloP evolutionary conservation 
scores [32]; indicators for whether or not the variant falls 
in a region of known copy number variation, a region 
containing insertions or deletions or a region with inver- 
sions [33,34]; PolyPhen-2 [35] probability that a mutation 
is damaging; and RegulomeDB score [36]. We used prin- 
cipal components analysis to reduce 75 Broad ChlP-seq, 
24 Caltech RNA-seq, and 24 PhyloP ENCODE track sum- 
mary statistics to the 18, 11 and 4 principal components, 
respectively, that explained at least 90% of variation in the 
class. 

While not a comprehensive set, the functional vari- 
ables we chose to work with covered the major annotation 
classes available at the time of analysis and are readily 
available to individuals executing an association study. 
The infrastructure and methods described here are easily 
updated to accommodate new variables as they become 
generally available. Table 1 lists the 57 variables that we 
used to construct the functional signatures of association. 

The 48,889 SNPs included in the analysis were grouped 
into 2,093 case and an equal number of control blocks. We 
randomly selected 1,675 of these matched case-control 
pairs for development of the model (the 'training set') 
and left the remaining 418 pairs for model evaluation 
and comparison (the 'evaluation set'). We modeled the 



probability that a SNP is associated given its functional 
data using a logistic regression model in which the asso- 
ciation status of the individual SNPs in the case blocks 
is assumed unknown. The model assumes, however, that 
each case block contains at least one truly associated SNP 
and that each control block contains none; as in [7,14], 
the identities of the associated SNPs are not specified as 
they are unknown (details of the model are provided in 
Section "Model"); a prior distribution on the overall frac- 
tion of associated SNPs was employed to encourage their 
number to be close to one on average. In this way, the 
model is free to identify multivariate patterns in the anno- 
tation data ('signatures') of individual SNPs that would be 
diluted by averaging the annotation measurements of all 
SNPs within a block. 

While the assembled list of functional predictors 
includes measures that have been demonstrated to cor- 
relate with the association status of SNPs in the GC, it 
also includes a number of measures whose utility in this 
regard was unclear. Hence, we expected that only a frac- 
tion of the 57 variables would contribute to predicting 
phenotype association. We used shrinkage priors [37,38] 
to reflect this belief and chose the normal-exponential- 
gamma (NEG) distribution for its ability to penalize 
heavily weakly determined predictors and to penalize 
weakly those that are well determined [39-41]. Further 
details of the model and the Markov chain Monte Carlo 



Iversen etal. BMC Genomics 2014, 15:398 
http://www.biomedcentral.com/1471-2164/15/398 



Page 5 of 16 



Table 1 Annotations used to construct the functional signatures 


Name 


Annotation class 


Description 


MAF/..4 


Minor Allele Frequency 


Natural spline basis for MAF 


funclntron 


dbSNP Function Class 


Indicator that variant is intronic. 


funcNg3 


dbSNP Function Class 


Indicator that variant is near-gene-3. 


funcNgb 


dbSNP Function Class 


Indicator that variant is near-gene-5. 


funcNonsynon 


dbSNP Function Class 


Indicator that variant is missense or nonsense. 


funcSynon 


dbSNP Function Class 


Indicator that variant is synonymous. 


funcUTR 


dbSNP Function Class 


Indicator that variant is in the 3' or 5' UTR. 


PhyPCL4 


phyloP Evol. Cons. Score 


First 4 PCs for PhyloP data. 


Indellnd 


DGV Regions 


Indicator that SNP is in the region of a known in-del. 


CNVInd 


DGV Regions 


Indicator that SNP is in the region of a known CNV. 


Invlnd 


DGV Regions 


Indicator that SNP is in the region of a known inversion. 


BrPCf../8 


ENCODE Super-Track 


PCs of Broad promoter/enhancer ChlP-seq data. 


CalPC/..// 


ENCODE Super-Track 


PCs of CalTech transcription level RNA-seq data. 


logDNase 


ENCODE Regulatory Super-Track 


DNasel hypersensitivity cluster log(score). 


TFBSfreq 


ENCODE Regulatory Super-Track 


SNP in ChlP-seq TFBS region(s) - count. 


logTFBS 


ENCODE Regulatory Super-Track 


SNP in ChlP-seq TFBS region(s) - log(TFBS score). 


OReglnd 


Open REGulatory ANNOtation DB 


Indicator that SNP is in ORegAnno DB. 


PPh2Prob 


PolyPhen-2 


Probability that SNP is damaging. 


RegDBcat 


RegulomeDB 


RegulomeDB category. 



Definitions of the 54 variables appearing in the prior model for association status arranged by type/class of annotation. 



(MCMC) algorithm used for inference can be found in 
Methods. 

Table 2 provides a summary of the coefficient esti- 
mates obtained for the binary regression of association 
status on the 57 functional annotation variables. Because 
all variables in the model were standardized, coefficients 
measure the difference in the log-odds of phenotype asso- 
ciation attributed to an increase of one standard deviation 
in the covariate when the others remain fixed. The major- 
ity of variation (51%) in the functional scores as measured 
in the control block SNPs from the validation set, is due to 
the Broad promoter/enhancer ChlP-seq principal com- 
ponents (PCs) and nearly all (> 97%) of this variation is 
due to PCs 1, 2, 4, 5, 6, 8 and 13. Each PC is a linear 
combination of the 75 summary statistics of the 25 assays. 
Additional file 1: Figure S3 depicts the loadings (weights 
in the linear combinations) for these PCs as they depend 
on histone modification, cell line and summary statis- 
tic. Grossly, PC 1 measures total signal strength across 
all cell lines and histone modifications, PC 2 contrasts 
average signal strength of the H3k4me3 assay against 
average signal strength of the remaining assays, while 
the remaining PCs each contrast signal in one subset of 
cell lines with that in another (PC 4: HMEC and NHEK 
versus GM12878 and K562; PC 5: GM12878, HMEC 
and NHEK versus HSMM, HUVEC and NHLF; PC 6: 



Hl-hESC, HepG2 and HSMM versus GM12878 and 
HUVEC; PC 8: K562 versus Hl-hESC; and PC 13: HepG2 
versus Hl-hESC). Additional file 1: Figure SI and S2 
depict the correlation matrix and a hierarchical clus- 
tering of the 75 summary statistics, respectively. For a 
given combination of cell line and histone modifica- 
tion, the signal strength ('log(mean)') and peak signal 
('log(maximum)') measures tend to be highly correlated 
with one another but tend to have only modest correlation 
with the signal consistency measure ('log(z)'); patterns of 
correlation across cell lines and histone modifications are 
more complex. 

The sequence conservation PCs collectively make the 
next largest contribution, explaining 16% of variation 
in the functional scores; PCs 1 and 3 explain > 97% 
of this. Each PC is a linear combination of the sum- 
mary statistics of the 28 and 44 species PhyloP scores, 
each for all species and restricted to placental mam- 
mals. Additional file 1: Figure S6 graphs the loadings 
for these PCs as they depend on number of species, 
depth of alignment and summary statistic; Additional 
file 1: Figure S4 and S5 depict the correlation matrix 
and a hierarchical clustering of these variables, respec- 
tively. Briefly, PC 1 measures total signal strength across 
scores with the scores based on the 28-way alignment 
weighted more heavily than those based on the 44-way 
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Table 2 Summary of estimates for the model for association status given the functional annotation data 



Coefficient 


Mean 


SD 


Mean/SD 


Coefficient 


Mean 


SD 


Mean/SD 


MAF1 


0.029 


0.0272 


1.051 


CalPC8 


0.096 


0.0608 


1.584 


MAF2 


0.003 


0.0185 


0.151 


CalPC9 


0.009 


0.0218 


0.406 


MAF3 


0.018 


0.0236 


0.759 


CalPCI 0 


-0.019 


0.0234 


-0.809 


MAF4 


-0.008 


0.0199 


-0.425 


CalPC1 1 


-0.044 


0.0468 


-0.943 


BrPCI 


-0.348 


0.0388 


-8.983 


PhyPCI 


0.225 


0.0452 


4.982 


BrPC2 


0.174 


0.0360 


4.845 


PhyPC2 


-0.023 


0.0308 


-0.742 


BrPC3 


0.002 


0.0172 


0.099 


PhyPC3 


0.053 


0.0395 


1.336 


BrPC4 


-0.077 


0.0301 


-2.561 


PhyPC4 


0.024 


0.0289 


0.839 


BrPC5 


0.149 


0.0301 


4.932 


funclntron 


-0.003 


0.0199 


-0.160 


BrPC6 


0.097 


0.0329 


2.961 


funcNg3 


0.002 


0.0238 


0.098 


BrPC7 


-0.019 


0.0229 


-0.825 


funcNg5 


0.003 


0.0200 


0.158 


BrPC8 


-0.078 


0.0312 


-2.498 


funcNonsynon 


0.089 


0.0387 


2.308 


BrPC9 


-0.012 


0.0202 


-0.573 


funcSynon 


-0.007 


0.0236 


-0.283 


BrPCIO 


0.007 


0.0182 


0.407 


funcUTR 


0.002 


0.0210 


0.078 


BrPC1 1 


0.021 


0.0239 


0.887 


logDNase 


0.011 


0.0253 


0.419 


BrPC12 


-0.039 


0.0290 


-1.343 


TFBSfreq 


0.024 


0.0267 


0.885 


BrPC13 


-0.094 


0.0318 


-2.972 


logTFBS 


0.019 


0.0299 


0.641 


BrPC14 


-0.000 


0.0175 


-0.015 


OReglnd 


0.027 


0.0236 


1.163 


BrPC15 


-0.039 


0.0286 


-1.354 


hdellnd 


-0.023 


0.0337 


-0.693 


BrPC16 


0.009 


0.0185 


0.467 


CNVInd 


0.059 


0.0321 


1.842 


BrPCI 7 


-0.015 


0.0213 


-0.696 


nvlnd 


0.090 


0.0305 


2.938 


BrPC18 


-0.014 


0.0210 


-0.688 


rDBcatl 


0.014 


0.0257 


0.550 


CalPCI 


-0.103 


0.0492 


-2.084 


rDBcat2 


0.066 


0.0378 


1.741 


CalPC2 


0.090 


0.0441 


2.030 


rDBcat3 


0.012 


0.0264 


0.467 


CalPC3 


-0.019 


0.0270 


-0.709 


rDBcat4 


0.116 


0.0461 


2.508 


CalPC4 


0.086 


0.0457 


1.889 


rDBcat5 


0.106 


0.0527 


2.003 


CalPC5 


0.053 


0.0395 


1.350 


rDBcat6 


-0.056 


0.0552 


-1.018 


CalPC6 


-0.012 


0.0233 


-0.496 


pph2prob 


0.078 


0.0269 


2.905 


CalPC7 


0.002 


0.0207 


0.078 











Estimates of the posterior mean and standard deviation are provided for each coefficient in the model along with the ratio of these quantities, a 'signal-to-noise' 
measure analogous to the Z statistic. 



alignment, while PC 3 contrasts the 28-way with 44-way 
scores. 

The CalTech RNA-seq PCs collectively explain 10% of 
the signature, with PCs 1, 2, 4 and 8 contributing 87% 
of this. Additional file 1: Figure S9 depicts the loadings 
for these PCs as they depend on cell line and summary 
statistic. PC 1 provides a measure of total signal strength 
across all cell lines, while the remaining PCs each con- 
trast signal in one subset of cell lines with that in another 
(PC 2: Hl-hESC and K562 versus GM12878 and NHEK; 
PC 4: GM12878 and Hl-hESC versus K562, NHEK and 
HepG2; PC8: HUVEC versus NHEK). Additional file 1: 
Figure S7 and S8 depict the correlation matrix and a hier- 
archical clustering of the RNA-seq summary statistics, 
respectively. These reveal the largely complementary 



information provided by each of the cell lines as well as 
high correlations between the signal strength, peak signal 
and signal indicator measures of a given cell line. Here, as 
in the Broad ChlP-seq data, the signal consistency mea- 
sure is only modestly correlated with the remaining signal 
measures. 

RegulomeDB categorizes variants into a seven-level 
functional score based on a synthesis of regulatory data 
derived from ENCODE and other sources. Category 7 
variants lack evidence of association; categories 4-6 show 
'minimal binding evidence'; categories 2 and 3 are likely' 
and less likely to affect binding', respectively; and cat- 
egory 1 variants are those likely to affect binding and 
linked to expression of a gene target'. Its score explains 
the next largest fraction (8%) of variation. It is represented 
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by six variables, each indicating a functional category; 
category 7 serves as the reference ('baseline'). Categories 
2, 4, 5 and 6 explain 99% of this variation suggesting that 
other annotation variables in the model better charac- 
terize the probability of phenotype association for vari- 
ants in categories 1 and 3. This is especially notable 
for category 1 as it is the only variable included in 
the model explicitly tied to eQTL status. This failure 
of simple eQTL summaries to provide predictive power 
once the other annotations are included in the model is 
consistent with our findings from a preliminary analy- 
sis that included lymphoblastoid line-based eQTL sum- 
maries derived from the SCAN database (http : / / www . 
scandb . org; [10]), but not the RegulomeDB variables. 
We found no practical or statistically detectable differ- 
ence between the accuracy, as measured by concordance 
index (see below and Methods), of out-of-sample pre- 
dictions made using the model including all annotations 
and the model including all annotations except the eQTL 
variables. 

Virtually all (96%) of the 2.5% contribution to variation 
made by the DGV variables is due to the copy number and 
inversion variables. Finally, at 1.7%, the dbSNP functional 
class variables are the only remaining that contribute more 
than 1% of the variation in functional scores. Virtually all 
(99%) of this contribution is due to the non-synonymous 
designation and the PolyPhen-2 probability contributes 
to the model's ability to predict within this class of vari- 
ants. This suggests that more direct measures of a variant's 
involvement in transcriptional regulation, such as location 
in or near a DNAse I hypersensitive site (DHS), are bet- 
ter predictors of association than proximity to a nearby 
transcription start site (TSS) or other gene-centric land- 
mark. Indeed, it has been shown [42] that GWAS vari- 
ants are concentrated in DHSs, features captured by the 
Broad promoter/enhancer ChlP-seq principal compo- 
nents and the transcription factor ChlP-seq data. These 
are more flexible measures that allow the model to dis- 
tinguish between promoters and more distant enhancers, 
inactive versus active promoters and local versus more 
distant enhancer-TSS interactions, each representing 
an important distinction [43,44] relevant to predicting 
function. 

We estimated the concordance indexes (equivalent to 
AUC, area under the ROC curve) for each model using 
the 418 matched case-control block pairs in the validation 
set as a tool for comparing the accuracy of their out- 
of-sample predictions. Table 3 provides the estimates of 
concordance and associated 95% interval estimates. While 
the concordance statistics are not discernibly different 
from one another, the best out-of-sample predictive abil- 
ity is achieved using the model with the prior distribution 
having the strongest shrinkage properties, i.e. the 'NEG3' 
model. 



Table 3 Means and 95% interval estimates of the 
concordance indices for each of the four models 



Concordance index 



Label 


Prior 


Mean 


95% CI 


Normal 


N(0, 1) 


0.6348 


(0.6112, 0.6555) 


NEG1 


NEG(0.834, 0.1610) 


0.6397 


(0.6148, 0.6615) 


NEG2 


NEG(0.950, 0.0588) 


0.6433 


(0.6208, 0.6675) 


NEG3 


NEG(0.978, 0.0245) 


0.6487 


(0.6244, 0.6675) 



Application to an ovarian cancer multi-GWAS study 

We compared the ranks assigned to a group of variants 
in a GWAS analysis when those ranks are calculated 
with and without the functional annotation data. Each 
in the group of variants is assumed to have known asso- 
ciation status (associated/unassociated) with epithelial 
ovarian cancer, where this determination is based on 
confirmatory studies subsequent to the GWAS. The 
group is constructed as follows. There are currently 
11 published, genome-wide significant loci for epithe- 
lial ovarian cancer. Nine of the 11 have come to light 
through analysis of genome-wide SNP data. These 
are rs3814113 [45], rs8170 [46], rs2072590, rs2665390, 
rs7814937, rs9303542 [47], rsll782652, rs7084454, 
rs757210 [48]. The remaining two (rsl0069690 and 
rs2077606) were identified by candidate gene/pathway 
investigations [49,50]; all 11 have been evaluated in very 
large confirmatory studies. We consider these to be 'true 
positive' variants. Our analysis of data from the large- 
scale follow-up study of GWAS candidates [48] allowed 
us to identify a group of 5,155 variants with 'strong evi- 
dence' (BF>10) in favor of the null model, i.e. the model 
of no association, based on Jeffreys' scale of evidence [51] 
that we treat here as 'true negatives'. 

Table 4 summarizes the GWAS results for the true 
positive and true negative SNPs when the analysis is con- 
ducted with (subscript A+F') and without (subscript A') 
the functional signatures. Note that, among the true pos- 
itives, the SNPs discovered through candidate studies 
are ranked substantially lower than those identified via 
GWAS. Indeed, the evidence in the association data for 
these variants is actually against association (both of their 
Bayes factors are less than 1.0). The GWAS SNPs are all 
ranked in the top 50,000 (of approximately 2.5 million) by 
the same measure and all have Bayes factors of at least 3 
to 1 in favor of association. 

Only two of the truly associated SNPs (rsll782652 
and rs9303542) are ranked higher when the functional 
data are ignored than when they are used, however their 
respective changes in rank are small. The median (alt. 
average) rank of the truly associated SNPs was 5,272 
(178,246) without and 3,532 (80,143) with the functional 
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Table 4 Functional signatures improve inference for association status in a GWAS of ovarian cancer 



Variant 


Locus 


MAF 


LO F 


log(BF fl ) 


Rankfl 


Rank A+F 


rs2072590 


2q31 


0.34 


1.46 


8.63 


65 


59 


rs2665390 


3q25 


0.09 


0.77 


8.08 


77 


73 


rs 10069690 


5p15 


0.23 


0.91 


-1.38 


1,549,122 


651,710 


rsl 1 782652 


8q21 


0.08 


0.22 


2.98 


5,272 


6,843 


rs7814937 


8q24 


0.12 


1.54 


14.61 


21 


16 


rs3814113 


9p22 


0.30 


-0.09 


14.01 


38 


38 


rs7084454 


10p12 


0.31 


1.44 


1.19 


45,616 


12,221 


rs757210 


17q12 


0.37 


1.74 


2.31 


11,630 


2,411 


rs2077606 


17q21 


0.18 


0.70 


-0.25 


339,456 


200,494 


rs9303542 


17q21 


0.27 


0.05 


3.70 


2,276 


3,532 


rs81 70 


19p13 


0.19 


0.82 


2.72 


7,133 


4,179 


Mean 


True + 


0.23 


0.87 


5.15 


1 78,246 


80,143 


Median 


True + 


0.23 


0.82 


2.98 


5,272 


3,532 


Mean 


True - 


0.35 


0.11 


0.37 


438,664 


517,810 


Median 


True - 


0.36 


0.06 


0.14 


181,116 


244,393 



Ranks of known associated variants (labeled 'true +') tend to improve (i.e. are closer to one) when association and functional data are incorporated in the analysis 
(RankA+F) relative to when only the association data are used (RankA) and, hence, are more likely to be studied further. Conversely, ranks of (very likely) unassociated 
variants (labeled 'true — ') tend to fall with inclusion of the functional data. The functional data for a given variant is summarized by its 'functional signature', defined as 
the prior log-odds of its association given the functional data (LOf). Aggregate (mean and median) values are provided for the true + set and the true - set. Ranks are 
out of approximately 2.5M variants. 



data included. If design constraints allowed only for fol- 
lowup of the top 5,000 variants, a larger fraction (7/11) 
would be discovered with addition of the functional data 
than without (5/11); with followup of 10,000 variants, 
these fractions become 8/11 and 7/11. In contrast, when 
the function data were included, the median (alt. aver- 
age) rank among a set of 'true negative' SNPs increased 
from 181,116 (438,664) to 244,393 (517,810), while the 
number selected for followup fell from 244 to 204 under 
the 5K scenario and from 443 to 373 under the 10K 
scenario. 

Functional signatures of tag SNPs correlate with function 
of tagged SNPs 

While a few of the functional variables, such as the func- 
tion class designation 'nonsynonymous', incorporated in 
the signature are base pair specific, most map to con- 
tiguous regions of 100's or 1000's of base pairs. Hence, 
the functional signatures associated with nearby SNPs are 
correlated. Figure 3 is a plot of the correlation between the 
functional signatures of adjacent SNPs that passed QC in 
the ovarian cancer GWAS described above as a function 
of the distance, measured in base pairs (BPs), between the 
two variants. This correlation is greater than 0.72 (alt 0.68) 
for more than 80% (alt 97.5%) of adjacent variants, corre- 
sponding to those at distances of 1470 (alt 4376) BPs or 
less. Hence, while there are gains to be realized in doing so, 
it is not necessary to impute to and annotate at the highest 
possible density to realize an increase in power to detect 



association through the use of functional signatures, a 
fact we demonstrated empirically above. Note that typical 
BP distances between tagged (not genotyped or imputed) 
variants and their nearest tag will be on the order of 
one half of the distances reported here for adjacent 
tags. 

Conclusions 

Using the GWAS Catalog as a sampling frame, we 
developed a model for the probability that a given poly- 
morphism is associated with an observable human phe- 
notype given a set of functional annotation variables and 
demonstrated that this model has the ability to predict 
a set of phenotype associated variants not used in the 
model building exercise. We demonstrate several meth- 
ods for incorporating functional annotation signatures 
defined by this model and evaluated for a SNP's annota- 
tion data as prior data and show through example that by 
doing so we improve the efficiency of GWAS scale anal- 
ysis to identify true positive associations for follow-up 
study. 

The approach we describe is computationally tractable 
and scalable to modern genome-wide analysis. Our use 
of penalized regression techniques to model the func- 
tional data and construct the function signatures allows 
us to consider a relatively large number of individual 
annotation variables while controlling for over-fitting. 
We evaluated sensitivity of the model's out-of-sample 
predictions to choice of shrinkage prior and found that 
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Figure 3 Correlation of functional signatures between adjacent HapMap \\/\\\ SNPs as a function of base pair distance (black line). 

Cumulative distribution function(CDF) of base pair distances across the genome (red line). 



the most aggressive choice we examined, the model 
whose results are summarized herein, resulted in the best 
out-of-sample concordance estimates. Our approach can 
be expanded and adapted to incorporate more detailed 
annotation data such as was recently released by the 
ENCODE consortium [11] or generated experimentally in 
individual labs. 

In principle, estimates of the parameters in the model 
for SNP association status given the functional data can 
be refined via Bayesian updating as part of an asso- 
ciation analysis. This requires an additional layer of 
analysis that is feasible, but computationally demand- 
ing to implement on a genome-wide scale. However, 
the value of this will be limited in settings where 
there are few truly associated SNPs and/or the case- 
control data supporting associations are weak, i.e. the 
vast majority of applications. Here, Bayesian updat- 
ing will yield estimates equivalent to those using the 
approach we describe above up to Monte Carlo simulation 
error. Indeed, we formally compared the two approaches 
using the ovarian cancer GWAS data and found little 
change in the median ranks of the true positive (3,532 
versus 3,705) and true negative SNPs (244,393 versus 
248,459). This suggests that the added value of Bayesian 
updating to the functional signatures will typically be 
limited. 

Performance for our integrative approach likely depends 
on the depth, specificity and density of coverage of the 



available annotation data. The current study defines a 
starting point and benchmark in each of these dimensions. 
In particular, while the depth of annotation considered 
here is sufficient to noticeably improve inference for 
association, it is clear from recent ENCODE Project 
Consortium publications that it reflects only a small 
fraction of the complexity present in the regulatory land- 
scape. Further, none of the annotation variables are tai- 
lored to the outcome phenotype; indeed, the ENCODE 
super-track data enter the model through linear combi- 
nations of the cell-line specific measurements, effectively 
averaging over cell type. In addition, the associations we 
observe in our study of the GWAS Catalog SNPs are 
those that tend to identify regulatory functions impor- 
tant to a range of phenotypes (i.e. those represented in 
the GWAS Catalog). This may explain, at least in part, the 
relatively modest nature of the improvements observed 
in our GWAS analysis and the failure of general sum- 
maries of eQTL status to contribute meaningfully to the 
functional signatures. Indeed, many regulatory processes 
are cell-type-specific [11,12] and hence will be more 
informative for a given phenotype if measured in the 
appropriate context [14]. However, determining the rel- 
evant annotation data, assuming it exists, for a given 
phenotype requires domain expertise and more careful 
modeling to create functional signatures. While Bayesian 
updating did not improve inferences in the ovarian can- 
cer GWAS example using the existing model, it may when 
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the model is generalized to incorporate disease-specific 
annotations. 

Finally, our analyses have been carried out entirely 
at the HapMap III density. Our approach succeeds 
at this density because the functional signatures of 
SNPs nearby, at distances typical of HapMap III, are 
highly correlated and hence the functional signatures of 
HapMap III polymorphisms essentially tag function of 
nearby polymorphisms not in the database. As cover- 
age (genotype/imputation density) of the typical associ- 
ation study becomes more complete, the need to rely 
on correlations between functional signatures will dimin- 
ish and their power to assist in identifying and local- 
izing associations is expected to increase. Association 
analyses at the density of the 1000 Genomes Project 
database [52] are now possible and will likely become 
common. The specificity of the functional signatures 
should improve when reconstructed and applied at this 
density as we plan to do as we continue to develop this 
approach. 

Methods 

Association analysis given annotation data 

Let G be an n by p matrix of SNP genotypes, D be an n 
by 1 vector of disease indicators where Z); = 1 if individ- 
ual i has the disease and Di = 0 otherwise, X be an n by 
r matrix of covariates used in the association model and 
F be a p by m matrix of SNP-level functional annotation 
data where n is the number of individuals, p is the number 
of SNPs, r is the number of covariates and m is the num- 
ber of annotation variables. Finally, \&XA be a p by 1 vector 
of 0-1 indicators of the (unknown) association status of 
the variants, where A s = 1 if SNP s is associated with the 
phenotype of interest. 

In what follows, we specify the likelihood for the associ- 
ation indicator given the association (X, D, G) and func- 
tion (F) data. To this end, we let Pr(A|AXG,F) oc 
nf=i P r (^s I A X, G s , F s ). This relies on two assumptions: 
(1) that the A s 's are conditionally independent given 
(X, D, G, F) and (2) that the A s 's are conditionally inde- 
pendent of other variants (G_ s , F_ s ) given (X, D, G s , F s ). 
The notation G_ s indicates the matrix obtained by remov- 
ing column s from G. 

Further, we assume that the disease phenotype data are 
conditionally independent of the functional data for SNP 
s given its association status, covariate and genotype data 
and that its association status is conditionally independent 
of the covariate and genotype data given its functional 
data. The latter assumption may be violated, for example, 
if the genotype data G s carries information about function 
(e.g. minor allele frequency) not included in F. Given this, 
the posterior odds of association of SNP s given its asso- 
ciation and functional data can be written as the product 
of the (prior) odds of its association given its functional 



data times the (integrated) likelihood ratio or Bayes Factor 
(BF) of the phenotype given the SNP genotype and other 
covariate data, i.e. 

aaia unvr-^ Pr( ^ = l\D,X,G s ,F s ) 

oddsG4 s = 1|AX, G S ,F S ) = — 

Pr(.4 s = 0\D, X,G S ,F S ) 

= MD\A S = 1,X,G S ) Pr(A s = l\F s ) 

Pr(D\A s =0,X,G s ) Pr(A s = 0\F s ) 

= BF S x odds(A s = l\F s ), 

We describe estimation of the association summary Bayes 
factor below. 

Given the binary, logistic link model developed below 
for association status given the functional data and the 
parameters a and fi, odds(^4 s |F s ) = exp (a + F s fi) and 
hence, given a and fj 



Vr{A s = l\D,X,G s ,F s ,a,p) 



BF s exp(o; + F s fJ) 
1 + BF s exp(« + F s p)' 
(1) 



where F s fi is the 'functional signature' of SNP s. Provided 
that estimates of a and fi are available from an external 
analysis such as described in the next section, one can 
estimate Vr(A s = 1 \D, X, G s , F s ) by 



J]Pr(A s = l\D,X,G s ,F s ,a i ,fi i )/I 



where the or* and fit are samples from the poste- 
rior distribution from an analysis such as described in 
Section "Functional signatures of known associations". 

The above procedure depends on estimates of the 
marginal likelihoods, 



Pr(D\X,G s ,A s = a) 



Pt(D\X,G s ,A s = a,e a )MO a ), 



of the association data for each SNP under H Q (A s = 0) 
and under H a (A s = 1). Pr(£> | X, G S ,A S = a, 9 a ) is a logis- 
tic regression of the disease status indicator, D, on the 
covariates, X, and SNP genotype, G s , and with coefficient 
vector 0\ under H a and is a logistic regression D on X with 
coefficient vector 6q under H Q . We place independent nor- 
mal mean 0, standard deviation 10 prior distributions on 
all components of 9q and 9\, with exception of the coeffi- 
cient of G s , which is accorded a normal mean 0, standard 
deviation 0.25 prior distribution, as the majority of log- 
odds estimates cited in the GWAS catalog are smaller 
than 0.5 in absolute value. We estimate the SNP-specific 
marginal likelihoods under each hypothesis of associa- 
tion using the Laplace approximation [53] implemented in 
software described in Wilson et al. (2010) [54] and avail- 
able from the authors. In cases where it is not convenient 
or possible to directly calculate Bayes factors, they can be 
approximated by the ABF [17,18], allowing the method 
to be used in conjunction with Frequentist association 
testing software. 
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Construction of functional signatures 

In what follows, we detail the steps we took to assemble 
the case-control study of SNPs used to build and evalu- 
ate the models for a variant's association status given the 
functional data. These comprised identification of a rep- 
resentative set of phenotype-associated SNPs to serve as 
'cases' in the analysis and a matched set of 'controls' and 
the collection of a set of measurements related to func- 
tion to serve as annotations for the variants. The process 
is depicted in Additional file 1: Figure S10 and described 
in detail below. 

Sampling frame 

Many genomewide genotyping arrays were designed to 
over-sample variants with characteristics related to their 
ability to explain phenotypic variation, such as prox- 
imity to coding regions, type of variant (e.g. missense) 
and minor allele frequency (MAF). Hence, a compari- 
son of case SNPs identified using such assays to con- 
trol SNPs drawn randomly from the HapMap or dbSNP, 
for example, may lead to spurious associations between 
assay design variables and the SNP's association with a 
human phenotype. In order to avoid confounding due 
to the selection method employed in the design of the 
genomewide genotyping platform, we constructed a sam- 
pling frame of SNPs by combining SNPs on the Affymetrix 
GeneChip Human Mapping 500K Array Set and the 
Illumina HumanHap550 Genotyping BeadChip, as this 
generation of arrays and their predecessors cover most 
of the reported findings in the GWAS Catalog attributed 
to an Affymetrix or Illumina product and these products 
were the most commonly used. We labeled SNPs in the 
sampling frame according to whether they appeared only 
on the Affymetrix list, only on the Illumina list or on both 
and confined attention to those variants appearing in both 
the Genome Browser's dbSNP 130 and HapMap Release 
27 tables (see Additional file 1: Table SI) and having a 
MAF estimated in HapMap's CEU sample to be 0.05 or 
larger. The sampling frame comprises 803,991 SNPs with 
421,072 unique to Illumina, 305,672 to Affymetrix and the 
remaining 77,247 common to both. 

Case and control selection 

The GWAS Catalog is subject to constant update and 
versions are available from several locations. We down- 
loaded the GWAS Catalog from the Genome Browser 
(time stamp and location in Additional file 1: Table 
SI). We confined attention to non-CNV variants in the 
GWAS Catalog discovered by association studies utiliz- 
ing an Affymetrix and/or an Illumina genomewide array 
and present in the sampling frame. We randomly chose 
a single representative of each set of SNPs appearing 
multiple times in the GWAS Catalog or sharing one or 
more 'LD partners' (see below). This left 2093 unique 



case SNPs, 1306 of which were unique to Illumina, 403 
unique to Affymetrix and 384 in common. We randomly 
matched one control SNP drawn from the sampling frame 
to each case SNP on chromosome, platform (Illumina 
only, Affymetrix only, on both) and MAF rounded to the 
nearest 0.02. We excluded SNPs in the sampling frame in 
LD (R 2 > 0) with one or more case SNPs as reported in 
the HapMap Release #27 LD files (see Additional file 1: 
Table SI) or sharing an LD partner with another control 
SNP. 

LD Partner identification 

SNPs in the GWAS Catalog are arguably more likely to 
tag the variant that is directly associated with the phe- 
notype than to be that variant [8]. For this reason, we 
identified and annotated each case and control SNP's 'LD 
partners' [8]. We defined LD partners as those SNPs with 
R 2 > 0.8 with a case or control SNP as reported in the 
HapMap Release #27 LD files. Hindorff et al. (2009) [8] 
chose a threshold of 0.9 but noted that their results were 
nearly the same when using thresholds of 1.0 and 0.8. We 
identified 20,924 LD partners of the case SNPs and 23,779 
LD partners of the control SNPs. 

Annotation data 

All data drawn from the UCSC Genome Browser [55] used 
the "March 2006 (NCBI36/hgl8)" assembly. Additional 
file 1: Table SI provides locations, revision dates and ref- 
erences for each of the annotation files referred to below. 
In what follows, we describe each class of annotation vari- 
able, its source and the parameterization we use for it in 
the models we fit. 

Variants described in dbSNP [56] release 130 are classi- 
fied according to their predicted function as determined 
by their locations relative to known genes in the reference 
assembly. Variants that fall within the coding sequence of 
a known gene are further described as 'non-synonymous' 
if they result in a change to the associated amino acid 
or 'synonymous' if they do not. A variant may have sev- 
eral such designations; for purposes of our analysis, we 
confine attention to each variant's primary designation. 
Those observed among the SNPs included in our anal- 
ysis are 'unknown', 'coding-synon', 'intron', 'near-gene- 
s', 'near-gene-5', 'nonsense', 'missense', 'untranslated-3', 
and 'untranslated-5'. Given the small number (n = 5) of 
nonsense variants, we created a coding-nonsynonymous' 
designation by combining the 'missense' and 'nonsense' 
categories; similarly, we combined the 'untranslated-3', 
and 'untranslated-5' designations into the category 
'untranslated'. 

We investigated the PhyloP evolutionary conservation 
scores [32] applied to 28- and 44-species alignments, 
and to those alignments restricted to the placental mam- 
mals and human, for their ability to predict the disease 
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association status of common variants. Each of the four 
relevant Genome Browser tables provides the sum of the 
score, its sum of squares and the number of nucleotides 
that contribute to these statistics within ranges of con- 
tiguous nucleotides. We calculated a standardized score 
(mean divided by standard deviation) for each range and 
alignment and assigned these values to SNPs within the 
range. The PhyloP conservation scores exhibited pair- 
wise correlations of up to 0.984. The top four, six, and 
nine PCs explain 90%, 96%, and 99% of the variabil- 
ity, respectively, in the 12 variables. The top four PCs 
were used. Additional file 1: Figure S4 and S5, respec- 
tively, depict a heat map of the correlation matrix and 
a dendrogram depicting an average linkage, one-minus- 
absolute-correlation-distance clustering of the 12 PhyloP 
variables used to construct the PCs. 

The Database of Genomic Variants (DGV) [33,34] is 
a compilation of reported genomic alterations spanning 
more than 1000 bases (>100 in the case of indels) 
observed in healthy subjects. We formed three vari- 
ables indicating, respectively, whether (=1) or not (=0) 
each SNP falls in a region of copy number variation, a 
region containing insertions or deletions or a region with 
inversions. 

The ENCODE Project [19,20] is an ambitious project 
to identify and characterize the various functional ele- 
ments present in the human genome sequence and to 
facilitate public access to the data it generates; its over- 
arching objective is to improve our knowledge of human 
disease processes by providing a more comprehensive 
understanding of human molecular biology. Application 
of ENCODE functional annotation data to the design, 
analysis and interpretation of GWAS studies is one way 
in which ENCODE data can quickly be put to use to 
shed light on human disease processes [20]. To this 
end, we examine the utility of the recently released 
ENCODE regulation super-track data available from, and 
displayed on, the Genome Browser for a priori predic- 
tion of functional, disease-associated variants. In par- 
ticular, we include variables (see below) summarizing: 
transcription levels assayed in six cell lines by RNA- 
seq [21,22] and represented as normalized read den- 
sity; density of sequence tags for H3K4Mel (Histone H3 
Lysine 4 monomethylation) associated with enhancer and 
promoter activity measured in eight cell lines, similarly 
coded measures of promoter- and enhancer-associated 
H3K27Ac (Histone H3 Lysine 27 acetylation) in eight 
cell lines and of promoter-associated H3K4Me3 (Histone 
H3 Lysine 4 tri-methylation) in nine cell lines [23,24]; 
evidence for the variant falling within a DNasel hypersen- 
sitivity cluster [25,26]; and the evidence for transcription 
factor binding measured via ChlP-seq [27-31]. 

The Broad ChlP-seq, Caltech RNA-seq, and PhyloP 
signal tracks are summarized at the level of genomic bins. 



The ChlP-seq signals are measured within 118,084 con- 
tiguous bins of 25,600 bases apiece. The RNA-seq and 
PhyloP signals are measured in sets of non-overlapping, 
non-uniform bins. Bins are indexed according to the hier- 
archical scheme described in [57]. The ENCODE database 
provides basic summary statistics of the signal densi- 
ties measured within each bin. These are: the number of 
valid data points, the minimum, the range, the sum and 
the sum-of-squares. Sums are non-negative and range 
across six orders of magnitude; in bins where the sum was 
zero, we set it to one. From these summaries we derived 
the mean, the ratio of the mean to the standard deviation 
('z') and the maximum for each bin. We interpreted these 
as bin-level measures of overall signal strength, signal 
consistency and peak signal, respectively. We transformed 
each of these measures to the log scale prior to analysis. 
Finally, we include an indicator variable for when a variant 
falls in a Caltech RNA-seq bin with zero signal. 

The three Broad types (signal enrichment for H3K4 
Mel, H3K27Ac and H3k4Me3 histone modifications) 
comprise data on eight, eight and nine cell lines, respec- 
tively. We found significant pairwise correlations among 
the 75 variables (25 each of log(mean), log(maximum), 
and log(z)), ranging as high as 0.949, and therefore con- 
ducted a principal components analysis to identify the 
linear combinations, i.e. principal components (PCs), that 
explain most of the variability in the data. The top 18, 
27 and 44 PCs explain 90%, 95% and 99% of variabil- 
ity, respectively, in the 75 measures. Finally, we mapped 
each SNP to the appropriate Broad bin and annotated 
each with the top 18 PCs for purposes of the analysis. 
Additional file 1: Figure SI and S2, respectively, depict 
a heat map of the correlation matrix and a dendro- 
gram depicting an average linkage, one-minus-absolute- 
correlation-distance clustering of the 75 Broad ChlP-seq 
variables used to construct the PCs. 

The Caltech tables comprise RNA-seq raw signal 
enrichment data on six cell lines. Pairwise correlations 
among the 24 variables ranged as high as 0.970. The top 
11, 12, and 16 PCs explain 92%, 95%, and 99% of the 
variability. The top 11 PCs were used in the analysis. 
Additional file 1: Figure S7 and S8, respectively, depict 
a heat map of the correlation matrix and a dendrogram 
depicting an average linkage, one-minus-absolute- 
correlation- distance clustering of the 24 Caltech 
RNA-seq variables used to construct the PCs. 

The transcription factor ChlP-seq data are summarized 
by scores, ranging from 6 to 1000, measuring strength of 
evidence for binding within specified, sometimes overlap- 
ping, chromosomal bins (clusters'). We summarize these 
data as they apply to each SNP using two variables: the 
number of clusters it intersects with ('TFBSfreq') and the 
average log e (score) ('logTFBS') assigned to those clusters 
(coded as 0 if the SNP does not intersect with a cluster). 
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Similarly, the DNasel hypersensitivity data are summa- 
rized by scores, ranging from 16 to 1000, within specified 
chromosomal bins (clusters'). We summarize these data 
as they apply to each SNP using (1) an indicator for the 
variant falling within a clusters and (2) the log e (score) 
assigned to that cluster. 

The Open REGulatory ANNOtation database (OReg- 
Anno) [58,59] is a curated collection of regulatory ele- 
ments. The Genome Browser ORegAnno table provides 
start and stop coordinates and annotations for elements in 
the database. For purposes of our analysis, we summarize 
these data with an indicator variable for whether or not a 
variant falls within an ORegAnno regulatory region. 

PolyPhen-2 (PPh2) [35] assigns to nonsynonymous 
SNPs a probability of being damaging based on the 
sequence, phylogenetic and structural information char- 
acterizing the amino acid substitution. 

RegulomeDB [36] annotates SNPs with known and pre- 
dicted regulatory elements in the intergenic regions of 
the human genome. Each SNP is assigned one of seven 
categories based on its likelihood of affecting protein 
binding. 

We collected the annotation variables described above 
for each SNP in the HapMap Release 27 tables into an 
annotations data base that is available on our web site. In 
what follows, we refer to the collection of these variables 
as F; when F is indexed we are referring to the functional 
data for a given SNP (F s ) or, as just below, to a given SNP, 
s, in a given LD block, b (F s fo). 

Model 

For purposes of the analysis, we assumed that blocks and 
the SNPs within the blocks were independent conditional 
on the functional data. We modeled the probability that 
a SNP s in block b was an associated SNP, n s b, given 
the functional data for that SNP, F s f,, using the logistic 
regression model logit(7T s fo) = a + F s bfi. We assumed 
that there was at least one associated SNP in each case 
block and that there were no associated SNPs in con- 
trol blocks. Hence, each case block contributed the factor 
[l — Y\"=l (1 — n sb)] to the likelihood, while each control 
block contributed YuLi (1 — n sb)- As a result, we expected 
at least 1,675 of the 48,888 SNPs in the training set to 
be phenotype associated. This corresponds to a = —3.34 
(columns of F are centered); if 10% (alt 20%) of case blocks 
contain two phenotype associated SNPs, a = —3.24 (alt 
-3.15). Hence the normal mean -3.24, standard deviation 
0.1 prior distribution we placed on a is consistent with 
our expectation that there were fewer than 2,178 (= 1.3 x 
1, 675) true phenotype associated variants among the case 
blocks. 

Our specification of the prior distribution on f3 was 
guided by the observation that, in the normal model with 
the normal-exponential-gamma (NEG) distribution as 



prior on the mean and the variance known, the poste- 
rior mode is identically zero when the maximum likeli- 
hood estimator (MLE) is in a neighborhood around zero, 
but rapidly converges to the MLE as the MLE diverges 
from zero (this setting approximates the more general 
one in which the NEG distribution is used as the prior 
distribution for a parameter whose likelihood is approxi- 
mately normal). The NEG distribution is specified by its 
shape and scale parameters and the width of the thresh- 
old neighborhood is a function of these parameters. For 
purposes of our analysis, we chose parameter values for 
which no more than 10% of the coefficients are outside 
of the threshold region with probability 0.90, a priori. We 
placed independent NEG prior distributions on the com- 
ponents of /J; in addition, we also considered the model 
with independent standard normal distributions on the 
components of /J. Inference for each of these models was 
carried out using the training set and were evaluated using 
the evaluation set. 

We used random-walk Markov chain Monte Carlo 
(MCMC) algorithms [60,61] to estimate summaries of 
the posterior distribution under each of the models. We 
started 10 independent chains per model from starting 
points drawn from the prior distribution. In each case, 
step sizes were adjusted so that parameter level accep- 
tance ratios fell between 0.3 and 0.5 during an initial, 
'burn-in' set of iterations not used for inference. We fixed 
the step sizes and ran the 10 chains from their leave- 
off positions for an additional 50,000 iterations per chain. 
Inspection of trace plots, as well as computation of the 
Gelman-Rubin [62], Heidelberger- Welch [63], Raftery- 
Lewis [64], and Geweke [65] diagnostics implemented 
in the CODA package [66] in R [67], indicated satisfac- 
tory convergence. We thinned the 10 chains by 1,000 and 
combined them to produce a sample of 500 coefficient 
vectors. 

We used the concordance index (CI) to measure the 
out-of-sample predictive accuracy of the model. We cal- 
culated the CI as the fraction of matched pairs in the 
'evaluation set' in which the average probability of asso- 
ciation given the functional data over the «i SNPs in the 
case block {'bl') was larger than the corresponding aver- 
age over the wn SNPs in the matched control block ('b0'); 
i.e. if 

Ml no 

J2 Vr( - As > bl \ F s,bl)/ n l > Pr ( A s,bo I F s>b0 )/n 0 , 
s=l s=l 

where we estimated Vr(A Si f, n \ F SyBn ) by 

500 

J^MAsM = l\F sM ,ai,pi)/500, 

i=l 

where the a/ and fa are MCMC samples saved from 
analysis of the training data. 
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Evaluation 

We carried out a genome-wide association analysis of 
serous ovarian cancer using the methods described above. 
The data for this analysis were drawn from GWAS studies 
conducted in the US [68] and the UK [45]. The genotype 
data from these studies were combined and imputed to 
HapMap III density, resulting in a data set comprising ana- 
lyzable genotypes at 2,500,004 SNPs for 7,272 subjects of 
European ancestry. The association analysis was confined 
to the 2,004 cases with advanced stage serous ovarian can- 
cer and the 3,272 available controls and was adjusted for 
study site and the first two principal components of the 
sample genotypes. We calculated Bayes factors (BFs) as 
described above and set the prior probability of associa- 
tion to be 0.00001 when estimating posterior probabilities; 
ranks are invariant to this choice. 

Several large scale studies conducted to follow up 
promising associations from these GWAS have identi- 
fied the eleven genome-wide significant loci listed in 
Table 4. We treat these as established or 'true positive' 
associations for purposes of evaluating the various asso- 
ciation measures. In addition, we identified a set of likely 
unassociated, 'true negative' SNPs from among 22,254 
GWAS followup SNPs placed on the iCOGS chip [48]. 
This analysis included 8,344 cases with advanced stage 
serous ovarian cancer and 22,913 controls of European 
ancestry and was adjusted for study site and the first five 
European ancestry principal components. We identified a 
subset of 5,155 SNPs with strong evidence against associa- 
tion (defined as BF<0.1 on Jeffreys' scale of evidence [51]) 
to serve as the 'true negatives'. 

We compared the rankings of these two sets of SNPs in 
the original GWAS analysis when association was mea- 
sured using genotype data only to those obtained with 
incorporation of the functional signatures. We compared 
the procedures based on their power to identify the 
truly associated variants for follow-up assuming budgets 
allowing for evaluation of the top 5,000 or 10,000 SNPs. 

In most association studies, genotypes are determined, 
through a combination of genotyping and imputation, for 
only a subset of the universe of variants. In this setting, 
it is standard to rely on correlations between genotyped 
variants ('tags') and those that are 'tagged' (not genotyped) 
to identify and localize associations. Likewise, the utility 
of functional signatures in a typical study will depend on 
the degree to which they reflect the likelihood of func- 
tion of both the tag for which it is calculated and for the 
set of variants it tags. We evaluated correlations between 
functional signatures, defined as {F s f3), for adjacent pairs 
of SNPs included in the ovarian cancer GWAS analysis. 
We identified the quantile of each adjacent variant pair 
in the overall distribution of distances measured in base 
pairs (BPs). For purposes of this analysis, we defined quan- 
tiles in increments of 0.025, i.e. with each containing 2.5% 



of the mass of the distance distribution. We estimated the 
Pearson correlation between the functional signatures of 
the adjacent SNP pairs within each quantile and plotted 
these estimates against BP distance, locating the estimates 
at the midpoints of the quantile bins. 

Additional file 



Additional file 1 : Supplemental table and figures. 
Competing interests 

The authors declare that they have no competing interests. 
Authors' contributions 

El conceived the study, coordinated its design, the analysis and drafting of the 
manuscript. GL carried out statistical and bioinformatic analyses and 
participated in drafting the manuscript. MC contributed to study design, the 
analysis plan and the manuscript. AM contributed to study design, 
interpretation of results and the manuscript. All authors have read, critically 
reviewed and approved its final version. 

Acknowledgements 

This work was funded by the National Institutes of Health through its Genes 
and Environment Initiative, grant number R01-HL090559 and through 
R21 -ES020796 (NIEHS) and by the National Science Foundation via 
DMS-1 1 06891. The ovarian cancer multi-GWAS analyzed herein were 
provided by the FOCI project (1 U1 9-CA1 481 1 2) of NCI's GAME-ON 
consortium. The authors wish to thank two anonymous referees for their 
important contributions to the manuscript. 

Author details 

1 Department of Statistical Science, Duke University, Box 90251, 27708-0251 
Durham, NC, USA. 2 Cancer Epidemiology Program, H. Lee Moffitt Cancer 
Center & Research Institute, 12902 Magnolia Drive, 3361 2 Tampa, FL, USA. 

Received: 20 December 2013 Accepted: 13 May 2014 
Published: 24 May 2014 

References 

1. ManolioTA: Genomewide association studies and assessment of the 
risk of disease. NEngl J Med 201 0, 363(2):1 66-1 76. doi:l 0.1 056/ 
NEJMra0905980. PMID20647212. http://www.nejm.org/doi/pdf/10.1056/ 
NEJMra0905980. 

2. Freedman ML, Monteiro ANA, Gayther SA, Coetzee GA, Risch A, Plass C, 
Casey G, Biasi MD, Carlson C, Duggan D, James M, Liu P, Tichelaar JW, 
Vikis HG, You M, Mills IG: Principles for the post-GWAS functional 
characterization of cancer risk loci. Nat Genet 201 1,43(6)513-5 18. 
doi:10.1038/ng.840. 

3. Witte JS, Greenland S, Haile RW, Bird CL: Hierarchical regression 
analysis applied to a study of multiple dietary exposures and breast 
cancer. Epidemiology 1 994, 5(6):61 2-621 . 

4. Aragaki CC, Greenland S, Probst-Hensch N, Haile RW: Hierarchical 
modeling of gene-environment interactions: estimating NAT2 
genotype-specific dietary effects on adenomatous polyps. Cancer 
Epidemiol Biomarkers & Prev 1997, 6(5):307-314. http://cebp.aacrjournals. 
org/content/6/5/307.full.pdf+html. 

5. Hung RJ, Brennan P, Malaveille C, Porru S, Donato F, Boffetta P, Witte JS: 
Using hierarchical modeling in genetic association studies with 
multiple markers: application to a case-control study of bladder 
cancer. Cancer Epidemiol Biomarkers & Prev 2004, 1 3(6):1 01 3-1 021 . 

6. Hung RJ, Baragatti M, Thomas D, McKay J, Szeszenia-Dabrowska N, 
Zaridze D, Lissowska J, Rudnai P, Fabianova E, Mates D, Foretova L, Janout 
V, Bencko V, Chabrier A, Moullan N, Canzian F, Hall J, Boffetta P, Brennan P: 
Inherited predisposition of lung cancer: a hierarchical modeling 
approach to DNA repair and cell cycle control pathways. Cancer 
Epidemiol Biomarkers & Prev 2007, 1 6(1 2):2736-2744. 



Iversen etal. BMC Genomics 2014, 15:398 
http://www.biomedcentral.com/1471-2164/15/398 



Page 15 of 16 



7. Veyrieras JB, Kim SY, Dermitzakis ET, Gilad Y, Stephens M, Pritchard JK: 
High-resolution mapping of expression-QTLs yields insight into 
human gene regulation. PLoS Genet 2008, 4(1 0):1 00021 4. 

doi:l 0.1 371/journal.pgen. 1000214. 

8. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, 
Manolio TA: Potential etiologic and functional implications of 
genome-wide association loci for human diseases and traits. Proc 
Nat Acad Sci 2009, 106(23)3362-9367. doi:10.1073/pnas.0903103106. 
http://www.pnas.Org/content/1 06/23/9362.full.pdf+html. 

9. Lee SI, Dudley AM, Drubin D, Silver PA, Krogan NJ, Peer D, Koller D: 
Learning a prior on regulatory potential from eQTL data. PLoS Genet 
2009, 5(1 ):1 000358. doi:1 0.1 371 /journal.pgen.1 000358. 

1 0. Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, Cox NJ: 
Trait-associated SNPs are more likely to be eQTLs: Annotation to 
enhance discovery from GWAS. PtoS Genet 201 0, 6(4):1 000888. 
doi:l 0.1 371/journal.pgen. 1 000888. 

11 An integrated encyclopedia of DNA elements in the human 
genome. Nature 2012, 489:57-74. doi:l 0.1 038/naturel 1247. 

1 2. Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M: Linking disease 
associations with regulatory information in the human genome. 
Genome Res 201 2, 22(9):1 748-1 759. doi:10.1 1 01/gr.l 361 27.1 1 1. http// 
genome.cshlp.org/content/22/9/1748.full.pdf+html. 

1 3. Carbonetto P, Stephens M: Integrated enrichment analysis of variants 
and pathways in genome-wide association studies indicates central 
role for IL-2 signaling genes in type 1 diabetes, and cytokine 
signaling genes in Crohn's disease. PLoS Genet 201 3, 9(1 0):1 003770. 
doi:l 0.1 371/journal.pgen. 1003770. 

14 Pickrell JK: Joint analysis of functional genomic data and 

genome-wide association studies of 1 8 human traits. 201 4. arXiv 

1311.4843 [q-bio.GN] 
1 5. Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint 

method for genomewide association studies by imputation of 

genotypes. Nat Genet 2007, 39:906-91 3. 
1 6 Servin B, Stephens M: Imputation-based analysis of association 

studies: Candidate regions and quantitative traits. PLoS Genet 2007, 

3(7):1 14. doi:l0.l371/journal.pgen.00301 14. 
1 7. Wakefield J: A Bayesian measure of the probability of false discovery 

in genetic epidemiology studies. Am J Hum Genet 2007, 81 (2):208-227. 

doi:l0.l086/519024. 
1 8 Wakefield J: Bayes factors for genome-wide association studies: 

comparison with p-values. Genet Epidemiol 2009, 33(l):79-86. 

doi:10.1002/gepi.20359. 

1 9. The ENCODE Project Consortium: Identification and analysis of 
functional elements in 1 % of the human genome by the ENCODE 
pilot project. Nature 2007, 447(71 46):799-81 6. 

20. The ENCODE Project Consortium: A user's guide to the encyclopedia 
of DNA elements (ENCODE). PLoS Biology 201 1 , 9(4): 1 001 046. 

doi:l 0.1 371/journal.pbio.1 001 046. 

21 . Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and 
quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 
2008, 5(7):621 -628. doi:1 0.1 038/nmeth.l 226. 

22. Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and 
memory-efficient alignment of short DNA sequences to the human 
genome. Genome Biol 2009, 1 0(3):25. 

23. Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B, 
Meissner A, Wernig M, Plath K, Jaenisch R, Wagschal A, Fell R, Schreiber SL, 
Lander ES: A bivalent chromatin structure marks key developmental 
genes in embryonic stem cells. Cell 2006, 1 25(2):31 5-326. 

24. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez 
P, Brockman W, Kim T-K, Koche RP, Lee W, Mendenhall E, O'Donovan A, 
Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, 
Lander ES, Bernstein BE: Genome-wide maps of chromatin state in 
pluripotent and lineage-committed cells. Nature 2007, 
448(7153)553-560. doi:10.1038/nature06008. 

25. Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Cao H, Yu M, 
Rosenzweig E, Goldy J, Haydock A, Weaver M, Shafer A, Lee K, Neri F, 
Humbert R, Singer MA, Richmond TA, Dorschner MO, McArthur M, 
Hawrylycz M, Green RD, Navas PA, Noble WS, Stamatoyannopoulos JA: 
Genome-scale mapping of DNase I sensitivity in vivo using tiling 
DNA microarrays. Nat Methods 2006, 3(7)5 11-518. 
doi:10.1038/nmeth890. 



26. Sabo PJ, Hawrylycz M.Wallace JC, Humbert R, Yu M, Shafer A, Kawamoto 
J, Hall R, Mack J, Dorschner MO, McArthur M, Stamatoyannopoulos JA: 
Discovery of functional noncoding elements by digital analysis of 
chromatin structure. Proc Nat Acad Sci 2004, 101 (48): 1 6837-6842. 

27. Euskirchen G, RoyceTE, Bertone P, Martone R, Rinn JL, Nelson FK, Sayward 

F, Luscombe NM, Miller P, Gerstein M, Weissman S, Snyder M: CREB binds 
to multiple loci on human chromosome 22. Mol Cell Biol 2004, 
24(9):3804-3814. 

28. Euskirchen GM, Rozowsky JS, Wei CL, Lee WH, Zhang ZD, Hartman S, 
Emanuelsson O, Stole V, Weissman S, Gerstein MB, Ruan Y, Snyder M: 
Mapping of transcription factor binding regions in mammalian cells 
by ChIP: comparison of array- and sequencing-based technologies. 
Genome Res 2007, 1 7(6):898-909. 

29. Martone R, Euskirchen G, Bertone P, Hartman S, Royce TE, Luscombe NM, 
Rinn JL, Nelson FK, Miller P, Gerstein M, Weissman S, Snyder M: 
Distribution of nf-it b-binding sites across human chromosome 22. 
Proc Nat Acad Sci USA 2003, 1 00(21 ):1 2247-1 2252. 

30. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen 

G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith OL, He A, Marra M, 
Snyder M, Jones S: Genome-wide profiles of ST ATI DNA association 
using chromatin immunoprecipitation and massively parallel 
sequencing. Nat Methods 2007,4(8):651-657. doi:10.1038/nmeth1068. 

31. Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, 
Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring 
of ChlP-seq experiments relative to controls. Nat Biotech 2009, 

27(1 ):66-75. doi:1 0.1 038/nbt.l 51 8. 

32. Siepel A, Pollard K, Haussler D: New methods for detecting 
lineage-specific selection. Res Computat Mol Biol 2006, 3909:1 90-205. 

33. lafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, 
Lee C: Detection of large-scale variation in the human genome. Nat 
Genet 2004, 36(9):949-951 . 

34. Zhang J, Feuk L, Duggan GE, Khaja R, Scherer SW: Development of 
bioinformatics resources for display and analysis of copy number 
and other structural variants in the human genome. Cytogenet& 
Genome Res 2006, 1 1 5(3/4)205-2 1 4. 

35. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, 
Kondrashov AS, Sunyaev SR: A method and server for predicting 
damaging missense mutations. Wot Methods 2010, 7:248-249. 

36. Boyle A, Hong E, Hariharan M, Cheng Y, Schaub M, Kasowski M, 
Karczewski K, Park J, Hitz B, Weng S, Cherry J, Snyder M: Annotation of 
functional variation in personal genomes using RegulomeDB. 
Genome Res 201 2, 22(9):1 790-1 797. doi:1 0.1 1 01/gr.l 37323.1 1 2. 

37. Hans CM: Bayesian lasso regression. Biometrika 2009, 96:835-845. 

38. Richardson S, Bottolo L, Rosenthal JS: Bayesian models for sparse 
regression analysis of high dimensional data. In Bayesian Statistics 9. 
Edited by Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, 
Smith AFM. Oxford: Oxford University Press; 201 1 . 

39. Griffin JE, Brown PJ: Bayesian adaptive lassos with non-convex penalization. 
Technical report, University of Kent, 2007. 

40. Hoggart CJ, Whittaker JC, De lorio M, Balding DJ: Simultaneous analysis 
of all SNPs in genome-wide and re-sequencing association studies. 
PtoS Genet 2008, 4(7):1 0001 30. doi:1 0.1 371/journal.pgen. 1 0001 30. 

41 . Griffin JE, Brown PJ: Inference with normal-gamma prior distributions 
in regression problems. Bayesian Anal 2010, 5(1 }:1 71 — 1 88. 

42. Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, 
Reynolds AP, Sandstrom R, Qu H, Brody J, Shafer A, Neri F, Lee K, Kutyavin 
T, Stehling-Sun S, Johnson AK, Canfield TK, Giste E, Diegel M, Bates D, 
Hansen RS, Neph S, Sabo PJ, Heimfeld S, Raubitschek A, Ziegler S, 
Cotsapas C, Sotoodehnia N, Glass I, Sunyaev SR, et al: Systematic 
localization of common disease-associated variation in regulatory 
DNA. Science 201 2, 337(6099):1 190-1 195. doi:l 0.1 1 26/science.l 222794. 
http://www.sciencemag.Org/content/337/6099/l 1 90.full.pdf. 

43. Jin F, Li Y, Dixon JR, Selvaraj S, Ye Z, Lee AY, Yen C-A, Schmitt AD, Espinoza 
CA, Ren B: A high-resolution map of the three-dimensional 
chromatin interactome in human cells. Nature 201 3, 503:290-294 
doi:l 0.1 038/naturel 2644. 

44. Sanyal A, Lajoie BR, Jain G, Dekker J: The long-range interaction 
landscape of gene promoters. Nature 2012, 489:109-1 13. 
doi:l 0.1 038/naturel 1279. 

45. Song H, Ramus SJ.Tyrer J, Bolton KL, Gentry-Maharaj A, Wozniak E, 
Anton-Culver H, Chang-Claude J, Cramer DW, DiCioccio R, DorkT, Goode 



Iversen etal. BMC Genomics 2014, 15:398 
http://www.biomedcentral.eom/1 471 -2 1 64/1 5/398 



Page 16of 16 



EL, Goodman MT, Schildkraut JM, Sellers T, Baglietto L, Beckmann MW, 
Beesley J, Blaakaer J, Carney ME, Chanock S, Chen Z, Cunningham JM, 
Dicks E, Doherty JA, Durst M, Ekici AB, Fenstermacher D, Fridley BL, Giles G, 
et al.: A genome-wide association study identifies a new ovarian 
cancer susceptibility locus on 9p22.2. Nature Genet 2009, 42:996-1 000. 
doi:10.l038/ng.424. 

46. Bolton KL, Tyrer J, Song H, Ramus SJ, Notaridou M, Jones C, SherT, 
Gentry-Maharaj A, Wozniak E, Tsai Y-Y, Weidhaas J, Paik D, Van Den Berg 
DJ, Stram DO, Pearce CL, Wu AH, Brewster W, Anton-Culver H, Ziogas A, 
Narod SA, Levine DA, Kaye SB, Brown R, Paul J, Flanagan J, Sieh W, 
McGuire V, Whittemore AS, Campbell I, Gore ME, et al.: Common 
variants at 1 9p1 3 are associated with susceptibility to ovarian 
cancer. Nat Genet 201 0, 42:880-884. 

47. Goode EL, Chenevix-Trench G, Song H, Ramus SJ, Notaridou M, 
Lawrenson K, Widschwendter M, Vierkant RA, Larson MC, Kruger-Kjaer S, 
Birrer MJ, Berchuck A, Schildkraut J, Tomlinson I, Kiemeney LA, Cook LS, 
Gronwald J, Garcia-Closas M, Gore ME, Campbell I, Whittemore AS, 
Sutphen R, Phelan C, Anton-Culver H, Pearce CL, Lambrechts D, Rossing 
MA, Chang-Claude J, Moysich KB, Goodman MT, et al.: A genome-wide 
association study identifies susceptibility loci for ovarian cancer at 
2q3 1 and 8q24. Nat Genet 201 0, 42:874-879. doi:1 0.1 038/ng.668. 

48. Pharoah PDP.Tsai Y-Y, Ramus SJ, Phelan CM, Goode EL, Lawrenson K, 
Buckley M, Fridley BL, Tyrer JP, Shen H, Weber R, Karevan R, Larson MC, 
Song H, Tessier DC, Bacot F, Vincent D, Cunningham JM, Dennis J, Dicks E, 
Aben KK, Anton-Culver H, Antonenkova N, Armasu SM, Baglietto L, 
Bandera EV, Beckmann MW, Birrer MJ, Bloom G, Bogdanova N, et al.: 
GWAS meta-analysis and replication identifies three novel 
susceptibility loci for ovarian cancer. Nat Genet 201 3, 45:362-370. 
doi:10.1038/ng.2564. 

49. Bojesen SE, Pooley KA, Johnatty SE, Beesley J, Michailidou K, Tyrer JP, 
Edwards SL, Pickett HA, Shen HC, Smart CE, Hillman KM, Mai PL, 
Lawrenson K, Stutz MD, Lu Y, Karevan R, Woods N, Johnston RL, French 
JD, Chen X, Weischer M, Nielsen SF, Maranian MJ, Ghoussaini M, Ahmed S, 
Baynes C, Bolla MK, Wang Q, Dennis J, McGuffog L: Multiple 
independent variants at the TERT locus are associated with 
telomere length and risks of breast and ovarian cancer. Nat Genet 
201 3, 45:371-384. doi:l 0.1 038/ng.2566. 

50. Permuth-Wey J, Lawrenson K, Shen HC, Velkova A, Tyrer JP, Chen Z, Lin 
H-Y, Ann Chen Y, Tsai Y-Y, Qu X, Ramus SJ, Karevan R, Lee J, Lee N, Larson 
MC, Aben KK, Anton-Culver H, Antonenkova N, Antoniou AC, Armasu SM, 
Bacot F, Baglietto L, Bandera EV, Barnholtz-Sloan J, Beckmann MW, Birrer 
MJ, Bloom G, Bogdanova N, Brinton LA, Brooks-Wilson A, et al.: 
Identification and molecular characterization of a new ovarian 
cancer susceptibility locus at 17q21.31. NatCommun 2013, 4:1627 
doi:10.l038/ncomms26l3. 

51 . Jeffreys H: Theory of Probability. 3rd edn. p. 459. Oxford: Oxford Univ. Press; 
1961. 

52. The 1000 Genomes Project Consortium: An integrated mapof genetic 
variation from 1 ,092 human genomes. Nature 201 2, 491 :56-65. 

53. Kass RE, Raftery AE: Bayes factors. J Am Stat Assoc 1 995, 90:773-795. 

54. Wilson MA, Iversen ES, Clyde MA, Schmidler SC, Schildkraut JM: 
Supplement to "Bayesian model search and multilevel inference for 
SNP association studies". Ann Appl Stat 201 0, 4(3):1 342-1 364. 

55. Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans 
M, Smith KE, Rosenbloom KR, Raney BJ, Pohl A, Pheasant M, Meyer LR, 
Learned K, Hsu F, Hillman-Jackson J, Harte RA, Giardine B, DreszerTR, 
Clawson H, Barber GP, Haussler D, Kent WJ: The UCSC genome browser 
database: update 201 0. Nucleic Acids Res 201 0, 38(suppl 1 ):61 3-61 9. 
doi:l 0.1 093/nar/gkp939. http://nar.oxfordjournals.org/content/38/ 
suppl_l/D613.full.pdf+html. 

56. Sherry S, Ward M, Kholodov M, Baker J, Phan L, Smigielski E, Sirotkin K: 
dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001 , 
29(1)308-311. 

57. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler 
D: The human genome browser at UCSC. Genome Res 2002, 
12(6):996-1006. 

58. Montgomery SB, Griffith OL, Sleumer MC, Bergman CM, Bilenky M, 
Pleasance ED, Prychyna Y, Zhang X, Jones SJM: ORegAnno: an open 
access database and curation system for literature-derived 
promoters, transcription factor binding sites and regulatory 
variation. Bioinformatics 2006, 22(5):637-640. 



59. Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, Aerts S, Mahony S, 
Sleumer MC, Bilenky M, Haeussler M, Griffith M, Gallo SM, Giardine B, 
Hooghe B, Van Loo, P, Blanco E, Ticoll A, LithwickS, Portales-Casamar E, 
Donaldson IJ, Robertson G, Wadelius C, De Bleser, P, Vlieghe D, Halfon MS, 
Wasserman W, Hardison R, Bergman CM, Jones SJM, The Open Regulatory 
Annotation Consortium: ORegAnno: an open-access community- 
driven resource for regulatory annotation. Nucleic Acids Res 2008, 
36(suppl 1):107-113, 

60. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E: 
Equations of state calculations by fast computing machines. J Chem 
Phys 1953,21:1087-1091. 

61. Gilks WR, Richardson S, Spiegelhalter DJ: Introducing Markov chain 
Monte Carlo. In Markov Chain Monte Carlo in Practice. Edited by Gilks WR, 
Richardson S, Spiegelhalter DJ. London: Chapman and Hall; 1 996. 

62. Gelman A, Rubin DB: Inference from iterative simulation using 
multiple sequences (with discussion). StatSci 1992, 7:457-51 1. 

63. Heidelberger P, Welch P: Simulation run length control in the 
presence of an initial transient. Oper Res 1 983, 31 :1 1 09-1 1 44. 

64. Raftery AE, Lewis SM: Implementing MCMC. In Markov Chain Monte 
Carlo in Practice. Edited by Gilks WR, Richardson S, Spiegelhalter DJ. 
London: Chapman and Hall; 1 996:1 1 5-1 27. 

65. Geweke J: Evaluating the accuracy of sampling-based approaches to 
calculating posterior moments. In Bayesian Statistics 4. Edited by 
Bernado J, Erger J, AP D, Smith A. Oxford, UK: Clarendon Press; 1992. 

66. Plummer M, Best N, Cowles K, Vines K: CODA: Output Analysis and 
Diagnostics for MCMC. 201 0. R package version 0.1 3-5. http://CRAN.R- 
project.org/package=coda. 

67. Ihaka R, Gentleman R: R: A language for data analysis and graphics. 
J Comput Graph Stat 1 996, 5(3):299-3 1 4. 

68. Permuth-Wey J, Kim D.Tsai Y-Y, Lin H-Y, Chen YA, Barnholtz-Sloan J, Birrer 
MJ, Bloom G, Chanock SJ, Chen Z, Cramer DW, Cunningham JM, Dagne G, 
Ebbert-Syfrett J, Fenstermacher D, Fridley BL, Garcia-Closas M, Gayther SA, 
Ge W, Gentry-Maharaj A, Gonzalez-Bosquet J, Goode EL, Iversen E, Jim H, 
Kong W, McLaughlin J, Menon U, Monteiro ANA, Narod SA, Pharoah PDP, 
et al.: LIN28B polymorphisms influence susceptibility to epithelial 
ovarian cancer. Cancer Res 201 1, 71(1 1)3896-3903. 
doi:10.1158/0008-5472.CAN-10-4167. http://cancerres.aacrjournals.org/ 
content/71/ll/3896.full.pdf+html. 

doi:1 0.1 1 86/1471-21 64-1 5-398 

Cite this article as: Iversen et al:. Functional annotation signatures of 
disease susceptibility loci improve SNP association analysis. BMCGenomics 
201415:398. 

V y 



Submit your next manuscript to BioMed Central 
and take full advantage of: 

• Convenient online submission 

• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at 
www.biomedcentral.com/submit 



o 



BioMed Central 



